The World Happiness Report is a landmark survey of the state of global happiness.The happiness scores and rankings use data from the Gallup World Poll (GWP). The scores are based on answers to the main life evaluation question asked in the poll. This question, known as the Cantril ladder, asks respondents to think of a ladder with the best possible life for them being a 10 and the worst possible life being a 0 and to rate their own current lives on that scale.
Further, the Happiness Report includes additional 6 factors (levels of GDP, life expectancy, generosity, social support, freedom, and corruption) which show the estimated extent to which each of the six factor is estimated to contribute to making life evaluations (happiness score) higher in each country than in Dystopia. The underlying raw datapoints for those estimations are provided by other organisations (e.g. WHO) or from the Gallup World Poll question results. Dystopia in this context, is a hypothetical country with values equal to the world’s lowest national averages for each of the six factors raw values. The purpose in establishing Dystopia is to have a benchmark against which all countries can be favorably compared (no country performs more poorly than Dystopia) in terms of each of the six key variables. Since life would be very unpleasant in a country with the world’s lowest incomes, lowest life expectancy, lowest generosity, most corruption, least freedom, and least social support, it is referred to as “Dystopia,” in contrast to Utopia.
Thus, each of the 6 factors values explain the contribution of each factor for the higher happiness score in a certain country than in Dystopia. That is why the happiness score can be calculated by: \[\sum_{i=1}^{6} factorvalue_i + dystopiahappiness + residual \]
This makes it clear, that the 6 factors are already the result of some sort of estimation and therefore cannot be used for analysing the variable importance. The resulting regression coefficients e.g. would not be helpful at all, as by including the residual in the dataset, the interception would be 0 and all the coefficients would result in 1.
That is why we looked for an additional version of the happiness dataset, which includes the actual raw values and which we can therefore use for analysing the variable importance and use in data dimension reduction steps.
Based on the happiness dataset we want to try to answer the follwing leading questions. ### What influences happiness? Can happiness be explained by certain factors? What are those factors and how much do they influence the happiness? For this questions we need the raw values to build our analysis on top. To answer this questions we decided to add additional factors which might explain the different happiness levels. We were interested in how drug abuse correlates with happiness and found suiting datasets for alcohol consumption and tabaco consumtion. Additionally we were intereseted in how the modern user of social media influeces happiness. However we only found a fitting internet dataset which captures the percentage of the individuals in a country which is using the Internet. ### Happiness over time? For the change of happiness we can use the plain happiness dataset as it captures the happiness scores and the explained by parts for the 6 factors over time. Therefore we can calculate an visualize the changes over time.
| Country | Region | Happiness.Rank | Happiness.Score | Standard.Error | Economy..GDP.per.Capita. | Family | Health..Life.Expectancy. | Freedom | Trust..Government.Corruption. | Generosity | Dystopia.Residual |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Switzerland | Western Europe | 1 | 7.587 | 0.03411 | 1.39651 | 1.34951 | 0.94143 | 0.66557 | 0.41978 | 0.29678 | 2.51738 |
| Iceland | Western Europe | 2 | 7.561 | 0.04884 | 1.30232 | 1.40223 | 0.94784 | 0.62877 | 0.14145 | 0.43630 | 2.70201 |
| Denmark | Western Europe | 3 | 7.527 | 0.03328 | 1.32548 | 1.36058 | 0.87464 | 0.64938 | 0.48357 | 0.34139 | 2.49204 |
| Country | Happiness.Rank | Happiness | Economy | Family | Health | Freedom | Trust | Generosity | Year | Region |
|---|---|---|---|---|---|---|---|---|---|---|
| Switzerland | 1 | 7.587 | 1.39651 | 1.34951 | 0.94143 | 0.66557 | 0.41978 | 0.29678 | 2015 | Western Europe |
| Iceland | 2 | 7.561 | 1.30232 | 1.40223 | 0.94784 | 0.62877 | 0.14145 | 0.43630 | 2015 | Western Europe |
| Denmark | 3 | 7.527 | 1.32548 | 1.36058 | 0.87464 | 0.64938 | 0.48357 | 0.34139 | 2015 | Western Europe |
For answering the questions “What influences happiness?” we had to use the raw data of the factors and not their “explained by” values. In addition, we wanted to add futher factors and added the following three datasets:
By merging the datasets we have now four additional factors.
To join all the different datasets we had to do some preprocessing which can be seen in the preprocessing step. The main steps where cleaning the data (region, countrycode, NaN) and joining the datasets based on the year and the countrycode.
After joining we noticed, that the three additional data sets do not contain data for the whole timespan 2015-2022.(fig. missing values full data) Therefore, we decided to use only one year for analysing the influential factors.
missing values full data
We inspected the missing values of each year and choose the year with the lowes missing values, year 2018 (fig “missing values 2018”). Then we excluded all rows containing missing values again. Figure “missing values 2017” shows e.g. that the smoking and the alcohol dataset did not contain any values for the year 2017. We also renamed the columns for having shorter labels.
The final influential factors dataset consists of 96 rows (countries for the year 2018) and 18 columns which quickly explained. A more detailed explanation can be seen in the Statistical Appendix of the world happiness report.
| Country | Region | Year | Happiness | Economy | Social | Health | Freedom | Generosity | Corruption | Positive | Negative | Government | Code | Alcohol | Population | Tobacco | Internet |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Albania | Central and Eastern Europe | 2018 | 5.004403 | 9.412399 | 0.6835917 | 68.7 | 0.8242123 | 0.0053850 | 0.8991294 | 0.7132996 | 0.3189967 | 0.4353380 | ALB | 7.17 | 2882735 | 29.2 | 65.40000 |
| Argentina | Latin America and Caribbean | 2018 | 5.792797 | 9.809972 | 0.8999116 | 68.8 | 0.8458947 | -0.2069366 | 0.8552552 | 0.8203097 | 0.3205021 | 0.2613523 | ARG | 9.65 | 44361150 | 21.8 | 77.70000 |
| Armenia | Commonwealth of Independent States | 2018 | 5.062449 | 9.119424 | 0.8144490 | 66.9 | 0.8076437 | -0.1491087 | 0.6768264 | 0.5814877 | 0.4548403 | 0.6708276 | ARM | 5.55 | 2951741 | 26.7 | 68.24505 |
missing values 2017
missing values 2018
One of the objectives of preliminary data analysis to get a feel for the data you are dealing with by describing the key features of the data and summarizing the results. We are focusing on the second dataset, the influential factors dataset, which includes the raw values and not the explained by values.
First we check via the summary how all the explanatory variables are distributed. As we can see they are on different scales, especially “Health”,“Population” and “Internet”. As we don’t want to have data reduction analysis be more driven on the larges distances, we scale them by \(\frac{(x - mean(x))}{sd(x)}\)
## Happiness Economy Social Health
## Min. :3.335 Min. : 6.630 Min. :0.5035 Min. :48.20
## 1st Qu.:4.702 1st Qu.: 8.570 1st Qu.:0.7396 1st Qu.:59.85
## Median :5.536 Median : 9.669 Median :0.8581 Median :66.80
## Mean :5.597 Mean : 9.394 Mean :0.8220 Mean :65.23
## 3rd Qu.:6.340 3rd Qu.:10.346 3rd Qu.:0.9130 3rd Qu.:71.20
## Max. :7.858 Max. :11.454 Max. :0.9660 Max. :75.00
## Freedom Corruption Generosity Positive
## Min. :0.5286 Min. :0.1506 Min. :-0.33638 Min. :0.4347
## 1st Qu.:0.7245 1st Qu.:0.6849 1st Qu.:-0.14312 1st Qu.:0.6427
## Median :0.8084 Median :0.7989 Median :-0.02550 Median :0.7353
## Mean :0.7945 Mean :0.7255 Mean :-0.01767 Mean :0.7114
## 3rd Qu.:0.8784 3rd Qu.:0.8559 3rd Qu.: 0.07377 3rd Qu.:0.8000
## Max. :0.9699 Max. :0.9520 Max. : 0.49938 Max. :0.8836
## Negative Government Alcohol Population
## Min. :0.1580 Min. :0.07971 Min. : 0.019 Min. :6.042e+05
## 1st Qu.:0.2132 1st Qu.:0.33120 1st Qu.: 4.280 1st Qu.:6.028e+06
## Median :0.2749 Median :0.50385 Median : 7.410 Median :1.585e+07
## Mean :0.2845 Mean :0.50944 Mean : 7.221 Mean :5.380e+07
## 3rd Qu.:0.3509 3rd Qu.:0.64084 3rd Qu.:10.570 3rd Qu.:5.042e+07
## Max. :0.5438 Max. :0.98812 Max. :15.090 Max. :1.353e+09
## Tobacco Internet
## Min. : 4.60 Min. : 8.00
## 1st Qu.:13.90 1st Qu.:30.80
## Median :22.80 Median :68.25
## Mean :22.21 Mean :59.34
## 3rd Qu.:27.95 3rd Qu.:81.62
## Max. :45.50 Max. :97.32
We can see that every factor is now on the same scale. We have some outliers for Corruption, Generosity and Population.
On the correlation matrix plot we see, that happiness has the strongest correlation with Economy (0.801), Internet (0.786), Social (0.768) and Health (0.767). For the correlations between the explanatory variables the following stand out:
One tool for getting a first glance on what influences happiness is linear regression. For the regression we use the unscaled data. If our linear model has good predictability, we can interpret the coefficients on how they influence the outcome. This is also called regression analysis, where the goal is to isolate the relationship between each explanatory variable and the outcome variable.
However, the interpretability assumes that you can only change the value of one explanatory variable and not the others at the same time. This of course is only true if there are no correlations between the explanatory variables. If this independence does not hold, we have a problem of multicollinearity. This can result in the coefficients swingging wildly based on which other independent variables are in the model. Therefore the coefficients become very sensitive to small changes in the model and can not be easily interpreted.
One way to asses how strong the explanatory variables are affected by multicollinearity is using the variance inflation factor (VIF). VIFs identify correlations and their strength. VIFs between 1 and 5 suggest that there is a small correlation, VIFs greater than 5 represent critical levels of multicollinearity where the coefficients are poorly estimated.
If we build a linear regression model on all explanatory variables, we get an R-squared of 0.8063. However, by plotting the VIF values we can see that a model based on all explanatory variables has severe multicollinearity. Therefore we can not interprete the coefficients for Internet, Health and Economy.
##
## Call:
## lm(formula = Happiness ~ ., data = not_scaled_data_factors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.60190 -0.24719 0.00124 0.28565 1.79684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.527e+00 1.487e+00 -1.027 0.3076
## Economy 3.317e-01 1.628e-01 2.037 0.0449 *
## Social 3.251e+00 9.952e-01 3.266 0.0016 **
## Health 7.641e-03 1.971e-02 0.388 0.6993
## Freedom 1.404e+00 8.833e-01 1.589 0.1159
## Corruption -1.247e+00 4.577e-01 -2.724 0.0079 **
## Generosity 7.633e-01 4.282e-01 1.783 0.0784 .
## Positive 6.045e-01 7.901e-01 0.765 0.4465
## Negative 2.332e+00 9.192e-01 2.537 0.0131 *
## Government -9.855e-01 4.520e-01 -2.180 0.0321 *
## Alcohol -3.825e-03 1.898e-02 -0.202 0.8407
## Population -3.861e-10 4.241e-10 -0.910 0.3654
## Tobacco -1.165e-02 7.295e-03 -1.596 0.1143
## Internet 6.001e-03 7.110e-03 0.844 0.4012
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.541 on 81 degrees of freedom
## Multiple R-squared: 0.8063, Adjusted R-squared: 0.7752
## F-statistic: 25.94 on 13 and 81 DF, p-value: < 2.2e-16
If we build a linear regression model without Internet and Economy, we get an R-squared of 0.7745. This R-squared is lower than prior, but after plotting the VIF values we can see that we are allowed to interpret the coefficients for the remaining explanatory variables, as all VIF values are below 5.
Interesting is that only Social, Health, Corruption, Negative and Government are statistically significant:
##
## Call:
## lm(formula = Happiness ~ . - Internet - Economy, data = not_scaled_data_factors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.63299 -0.30363 -0.02198 0.34810 2.08143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.609e+00 1.357e+00 -1.186 0.238858
## Social 4.646e+00 9.788e-01 4.746 8.54e-06 ***
## Health 5.214e-02 1.626e-02 3.207 0.001908 **
## Freedom 8.769e-01 9.265e-01 0.946 0.346660
## Corruption -1.616e+00 4.667e-01 -3.463 0.000847 ***
## Generosity 4.041e-01 4.430e-01 0.912 0.364406
## Positive 8.449e-01 8.287e-01 1.020 0.310893
## Negative 1.927e+00 9.682e-01 1.990 0.049879 *
## Government -1.016e+00 4.768e-01 -2.132 0.035974 *
## Alcohol 4.612e-03 2.003e-02 0.230 0.818514
## Population -1.135e-10 4.242e-10 -0.268 0.789649
## Tobacco -8.494e-03 7.700e-03 -1.103 0.273137
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5766 on 83 degrees of freedom
## Multiple R-squared: 0.7745, Adjusted R-squared: 0.7446
## F-statistic: 25.92 on 11 and 83 DF, p-value: < 2.2e-16
Next we tried out a linear regrssion method with shrinkage. For the lasso regression some estimates can become exactly zero. The result is therfore a type of variable selection and makes the model sparse and easier to interpret. For Lasso regression all predictor variables should be scaled so that they have the same standard deviation. Otherwise, the predictor variables have weighting in the penalty term. The glmnet() function however standardizes the predictors by default and the output coefficients are recalculated to apply to the original scale.
## [1] "Lasso Regression"
## 12 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -0.14935822
## Social 3.40444126
## Health 0.04983356
## Freedom .
## Corruption -0.73108727
## Generosity .
## Positive 0.32029544
## Negative .
## Government .
## Alcohol .
## Population .
## Tobacco .
The results of the lasso regression confirm our results from the normal regression for Social, Health and Corruption. However Positive is added and Negative and Government is removed from the model.
(Colour by region)
par(mar = c(4, 4, .1, .1))
pca <- prcomp(scaled_data_factors)
pca$rotation[,1:2]
## PC1 PC2
## Happiness 0.36333118 -0.07760930
## Economy 0.38726934 0.08154853
## Social 0.37119298 0.03620142
## Health 0.36968985 0.07549399
## Freedom 0.17431219 -0.46478217
## Corruption -0.19249458 0.37715677
## Generosity -0.09369932 -0.40223701
## Positive 0.17977494 -0.37178287
## Negative -0.28717674 0.08337116
## Government -0.11538614 -0.46978046
## Alcohol 0.26368478 0.07250532
## Population -0.06474133 -0.07959730
## Tobacco 0.13806804 0.25934932
## Internet 0.38269084 0.12462216
plot(pca$x, xlab= c("PC1:", round((pca$sdev[1]^2/ sum(pca$sdev^2)),2)), ylab= c("PC2:",round(pca$sdev[2]^2/sum(pca$sdev^2)),2), col =not_scaled_data_factors$Happiness )
legend("topright", legend=not_scaled_data_factors$Happiness, pch=16, col=not_scaled_data_factors$Happiness)
plot(pca$x, xlab= c("PC1:", round(pca$sdev[1]^2/ sum(pca$sdev^2)),2), ylab= c("PC2:",round(pca$sdev[2]^2/sum(pca$sdev^2)),2), col =not_scaled_data_factors$Happiness )
legend("topright", legend=not_scaled_data_factors$Happiness, pch=16, col=not_scaled_data_factors$Happiness)
SOM Fanplot (2015)
Mappings for SOM
Alt Text
geography map (color each country base on the percentage change over time (2015-2022))
## Warning: Paket 'dplyr' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'corrplot' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'pals' wurde unter R Version 4.1.3 erstellt
#box <- ggplot(data_2018, aes(x = Region, y = Happiness, color = Region), ) +
# geom_boxplot() +
# geom_jitter(aes(color=Country), size = 0.5) +
# ggtitle("Happiness Score for Regions and Countries") +
# coord_flip() +
# theme(legend.position="none")
#ggplotly(box)
## Warning: Paket 'viridis' wurde unter R Version 4.1.2 erstellt
## Warning: Paket 'tidyverse' wurde unter R Version 4.1.2 erstellt
## Warning: Paket 'tibble' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'tidyr' wurde unter R Version 4.1.2 erstellt
## Warning: Paket 'purrr' wurde unter R Version 4.1.2 erstellt
## Warning: Paket 'stringr' wurde unter R Version 4.1.2 erstellt
## Warning: Paket 'forcats' wurde unter R Version 4.1.2 erstellt
## Warning: Paket 'ggpubr' wurde unter R Version 4.1.3 erstellt
## Warning: Removed 157 rows containing non-finite values (stat_smooth).